AIML module project part I consists of industry based problems statement which can be solved using clustering techniques
AIML module project part II consists of designing a synthetic data generation model for a company which has a predesigned dataset.
AIML module project part III consists of industry based problems statement which can be solved using dimensional reduction techniques
AIML module project part IV consists of designing a data driven ranking model for a sports management company.
AIML module project part V consists of implementing dimensionality reduction on multimedia dataset.
Automobile
The data concerns city-cycle fuel consumption in miles per gallon, to be predicted in terms of 3 multivalued discrete and 5 continuous attributes
The data concerns city-cycle fuel consumption in miles per gallon
Goal is to cluster the data and treat them as individual datasets to train Regression models to predict ‘mpg’ Steps and tasks: [ Total Score: 25 points]
#import necessary packages
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np
from scipy.stats import norm
from scipy import stats
import warnings
warnings.filterwarnings('ignore')
%matplotlib inline
#Load json file
import json
f = open('Part1 - Car-Attributes.json')
datajson = json.load(f)
#read csv file
datacsv = pd.read_csv('Part1 - Car name.csv')
datacsv.shape
datacsv.head()
#Normalize json file
from pandas.io.json import json_normalize
datajson1 = json_normalize(datajson)
print(datajson1)
#convery json to csv file
datajson1.to_csv('df.csv', index=False)
df = pd.read_csv('df.csv')
df.head()
df.shape
#merge both csv files
data = pd.merge(datacsv, df, right_index=True, left_index=True)
data = data.set_index('car_name')
data
data.to_csv('Unsupervised Learning Project - Part 1 Combined Dataset.csv',index=True)
print(data.head())
print(data.index)
print(data.columns)
So there it is..lots of numbers. We can see that the dataset has the following columns (with their type):
data.shape
data.isnull().any()
# There are no null values
data.dtypes
But then, why is horsepower an object and not a float, the values we saw above were clearly numbers Lets try converting the column using astype()
Let's look at the unique elements of horsepower to look for discrepancies
data.hp.unique()
When we print out all the unique values in horsepower, we find that there is '?' which was used as a placeholder for missing values. Lest remove these entries.
data = data[data.hp != '?']
print('?' in data.hp)
data.shape
data.dtypes
So we see all entries with '?' as place holder for data are removed. However, we the horsepower data is still an object type and not float. That is because pandas coerced the entire column as object when we imported the data set due to '?', so lest change that data
data.hp = data.hp.astype('float')
data.dtypes
Now everything looks in order so lets continue, let's describe the dataset
data.describe()
data.mpg.describe()
#So the minimum value is 9 and maximum is 46, but on average it is 23.44 with a variation of 7.8
sns.distplot(data['mpg'])
print("Skewness: %f" % data['mpg'].skew())
print("Kurtosis: %f" % data['mpg'].kurt())
Using our seaborn tool we can look at mpg:
In order to do so, lets define a function scale:
def scale(a):
b = (a-a.min())/(a.max()-a.min())
return b
data_scale = data.copy()
data_scale ['disp'] = scale(data_scale['disp'])
data_scale['hp'] = scale(data_scale['hp'])
data_scale ['acc'] = scale(data_scale['acc'])
data_scale ['wt'] = scale(data_scale['wt'])
data_scale['mpg'] = scale(data_scale['mpg'])
data_scale.head()
All our data is now scaled to the same range of [0,1]. This will help us visualize data better. We used a copy of the original data-set for this as we will use the data-set later when we build regression models
data['Country_code'] = data.origin.replace([1,2,3],['USA','Europe','Japan'])
data_scale['Country_code'] = data.origin.replace([1,2,3],['USA','Europe','Japan'])
data_scale.head()
var = 'Country_code'
data_plt = pd.concat([data_scale['mpg'], data_scale[var]], axis=1)
f, ax = plt.subplots(figsize=(8, 6))
fig = sns.boxplot(x=var, y="mpg", data=data_plt)
fig.axis(ymin=0, ymax=1)
plt.axhline(data_scale.mpg.mean(),color='r',linestyle='dashed',linewidth=2)
The red line marks the average of the set. From the above plot we can observe:
var = 'yr'
data_plt = pd.concat([data_scale['mpg'], data_scale[var]], axis=1)
f, ax = plt.subplots(figsize=(8, 6))
fig = sns.boxplot(x=var, y="mpg", data=data_plt)
fig.axis(ymin=0, ymax=1)
plt.axhline(data_scale.mpg.mean(),color='r',linestyle='dashed',linewidth=2)
var = 'cyl'
data_plt = pd.concat([data_scale['mpg'], data_scale[var]], axis=1)
f, ax = plt.subplots(figsize=(8, 6))
fig = sns.boxplot(x=var, y="mpg", data=data_plt)
fig.axis(ymin=0, ymax=1)
plt.axhline(data_scale.mpg.mean(),color='r',linestyle='dashed',linewidth=2)
corrmat = data.corr()
f, ax = plt.subplots(figsize=(12, 9))
sns.heatmap(corrmat, square=True);
factors = ['cyl','disp','hp','acc','wt','mpg']
corrmat = data[factors].corr()
f, ax = plt.subplots(figsize=(12, 9))
sns.heatmap(corrmat, square=True);
#scatterplot
sns.set()
sns.pairplot(data, size = 2.0,hue ='Country_code')
plt.show()
So far, we have seen the data to get a feel for it, we saw the spread of the desired variable MPG along the various discrete variables, namely, Origin, Year of Manufacturing or Model and Cylinders. Now lets extract an additional discrete variable company name and add it to this data. We will use regular expressions and str.extract() function of pandas data-frame to make this new column
data.index
As we can see the index of the data frame contains model name along with the company name. Now lets use regular expressions to quickly extract the company names. As we can see the index is in format 'COMPANY_NAME - SPACE -MODEL - SPACE -VARIANT' and so regular expressions will make it an easy task.
data[data.index.str.contains('subaru')].index.str.replace('(.*)', 'subaru dl')
data['Company_Name'] = data.index.str.extract('(^.*?)\s')
That does it, almost, we can see NaN so some text was not extracted, this may be due to difference in formatting. We ca also see that some companies are named differently and also some spelling mistakes, lets correct these.
data['Company_Name'] = data['Company_Name'].replace(['volkswagen','vokswagen','vw'],'VW')
data['Company_Name'] = data['Company_Name'].replace('maxda','mazda')
data['Company_Name'] = data['Company_Name'].replace('toyouta','toyota')
data['Company_Name'] = data['Company_Name'].replace('mercedes','mercedes-benz')
data['Company_Name'] = data['Company_Name'].replace('nissan','datsun')
data['Company_Name'] = data['Company_Name'].replace('capri','ford')
data['Company_Name'] = data['Company_Name'].replace(['chevroelt','chevy'],'chevrolet')
data['Company_Name'].fillna(value = 'subaru',inplace=True) ## Strin methords will not work on null values so we use fillna()
var = 'Company_Name'
data_plt = pd.concat([data_scale['mpg'], data[var]], axis=1)
f, ax = plt.subplots(figsize=(20,10))
fig = sns.boxplot(x=var, y="mpg", data=data_plt)
fig.set_xticklabels(ax.get_xticklabels(),rotation=30)
fig.axis(ymin=0, ymax=1)
plt.axhline(data_scale.mpg.mean(),color='r',linestyle='dashed',linewidth=2)
data.Company_Name.isnull().any()
var = 'mpg'
data[data[var]== data[var].min()]
data[data[var]== data[var].max()]
var='disp'
data[data[var]== data[var].min()]
data[data[var]== data[var].max()]
var = 'hp'
data[data[var]== data[var].min()]
data[data[var]== data[var].max()]
var='wt'
data[data[var]== data[var].min()]
data[data[var]== data[var].max()]
var='acc'
data[data[var]== data[var].min()]
data[data[var]== data[var].max()]
Now that we have looked at the distribution of the data along discrete variables and we saw some scatter-plots using the seaborn pairplot. Now let's try to find some logical causation for variations in mpg. We will use the lmplot() function of seaborn with scatter set as true. This will help us in understanding the trends in these relations. We can later verify what we see with ate correlation heat map to find if the conclusions drawn are correct. We prefer lmplot() over regplot() for its ability to plot categorical data better. We will split the regressions for different origin countries.
var = 'hp'
plot = sns.lmplot(var,'mpg',data=data,hue='Country_code')
plot.set(ylim = (0,50))
var = 'disp'
plot = sns.lmplot(var,'mpg',data=data,hue='Country_code')
plot.set(ylim = (0,50))
var = 'wt'
plot = sns.lmplot(var,'mpg',data=data,hue='Country_code')
plot.set(ylim = (0,50))
var = 'acc'
plot = sns.lmplot(var,'mpg',data=data,hue='Country_code')
plot.set(ylim = (0,50))
data['Power_to_weight'] = ((data.hp*0.7457)/data.wt)
data.sort_values(by='Power_to_weight',ascending=False ).head()
So far, we have a looked at our data using various pandas methods and visualized it using seaborn package. We looked at
MPG distribution by country of origin AND MPG distribution by number of cylinders
Now that we know what our data looks like, lets use some machine learning models to predict the value of MPG given the values of the factors. We will use pythons scikit learn to train test and tune various regression models on our data and compare the results. We shall use the following regression models:-
Linear Regression
GBM Regression
data.head()
from sklearn.preprocessing import StandardScaler
from sklearn.tree import DecisionTreeRegressor
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error
from math import sqrt
from sklearn.ensemble import GradientBoostingRegressor
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import KFold
factors = ['cyl','disp','hp','acc','wt','origin','yr']
X = pd.DataFrame(data[factors].copy())
y = data['mpg'].copy()
X = StandardScaler().fit_transform(X)
X_train,X_test,y_train,y_test=train_test_split(X,y,test_size = 0.33,random_state=324)
X_train.shape[0] == y_train.shape[0]
regressor = LinearRegression()
regressor.get_params()
regressor.fit(X_train,y_train)
y_predicted = regressor.predict(X_test)
rmse = sqrt(mean_squared_error(y_true=y_test,y_pred=y_predicted))
rmse
gb_regressor = GradientBoostingRegressor(n_estimators=4000)
gb_regressor.fit(X_train,y_train)
gb_regressor.get_params()
y_predicted_gbr = gb_regressor.predict(X_test)
rmse_bgr = sqrt(mean_squared_error(y_true=y_test,y_pred=y_predicted_gbr))
rmse_bgr
fi= pd.Series(gb_regressor.feature_importances_,index=factors)
fi.plot.barh()
Good, so our initial models work well, but these metrics were performed on test set and cannot be used for tuning the model, as that will cause bleeding of test data into training data, hence, we will use K-Fold to create Cross Validation sets and use grid search to tune the model.
from sklearn.decomposition import PCA
pca = PCA(n_components=2)
pca.fit(data[factors])
pca.explained_variance_ratio_
pca1 = pca.components_[0]
pca2 = pca.components_[1]
transformed_data = pca.transform(data[factors])
pc1 = transformed_data[:,0]
pc2 = transformed_data[:,1]
plt.scatter(pc1,pc2)
c = pca.inverse_transform(transformed_data[(transformed_data[:,0]>0 )& (transformed_data[:,1]>250)])
factors
c
data[(data['yr'] == 70 )&( data.disp>400)]
The exceptionally far away point seems to be the Buick estate wagon. This seems logical as the weight data given in the data set seems to be incorrect. The weight for the vehicle is given to be 3086 lbs, however, on research it can be found that the car weight is 4727-4775 lbs.
cv_sets = KFold(n_splits=10, shuffle= True,random_state=100)
params = {'n_estimators' : list(range(40,61)),
'max_depth' : list(range(1,10)),
'learning_rate' : [0.1,0.2,0.3] }
grid = GridSearchCV(gb_regressor, params,cv=cv_sets,n_jobs=4)
grid = grid.fit(X_train, y_train)
grid.best_estimator_
gb_regressor_t = grid.best_estimator_
gb_regressor_t.fit(X_train,y_train)
y_predicted_gbr_t = gb_regressor_t.predict(X_test)
rmse = sqrt(mean_squared_error(y_true=y_test,y_pred=y_predicted_gbr_t))
rmse
data.duplicated().any()
from sklearn.cluster import KMeans
from scipy.stats import zscore
data['Country_code'].unique()
dataScaled=data.iloc[:,:-3]
#sns.pairplot(dataScaled,diag_kind='kde')
dataScaled.head()
#Finding optimal no. of clusters
from scipy.spatial.distance import cdist
clusters=range(1,10)
meanDistortions=[]
for k in clusters:
model=KMeans(n_clusters=k)
model.fit(dataScaled)
prediction=model.predict(dataScaled)
meanDistortions.append(sum(np.min(cdist(dataScaled, model.cluster_centers_, 'euclidean'), axis=1)) / dataScaled.shape[0])
plt.plot(clusters, meanDistortions, 'bx-')
plt.xlabel('k')
plt.ylabel('Average distortion')
plt.title('Selecting k with the Elbow Method')
After the graph reaches k=4, it begins to flatten, and hence k=4 is the optimum no. of clusters to be selected
final_model=KMeans(4)
final_model.fit(dataScaled)
prediction=final_model.predict(dataScaled)
#Append the prediction
dataScaled["GROUP"] = prediction
print("Groups Assigned : \n")
dataScaled.head()
Analyze the distribution of the data among the clusters (K = 4). One of the most informative visual tool is boxplot.
dataKMeansClust = dataScaled.groupby(['GROUP'])
dataKMeansClust.mean()
dataKMeansClust.boxplot(by='GROUP', layout = (2,4),figsize=(10,10))
originalData = dataScaled
dataScaled.head()
# Change categorical data to number 0-2
datac = data.iloc[:,:-2]
datac["Country_code"] = pd.Categorical(datac["Country_code"])
datac["Country_code"] = datac["Country_code"].cat.codes
# Change dataframe to numpy matrix
data1 = datac.values[:, 0:7]
category = df.values[:, 7]
datac.head()
# Number of clusters
k = 4
# Number of training data
n = data1.shape[0] - 1
# Number of features in the data
c = data1.shape[1]
# Generate random centers, here we use sigma and mean to ensure it represent the whole data
mean = np.mean(data1, axis = 0)
std = np.std(data1, axis = 0)
centers = np.random.randn(k,c)*std + mean
# Plot the data and the centers generated as random
colors=['orange', 'blue', 'green']
for i in range(n-1):
plt.scatter(data1[i, 0], data1[i,1], s=7, color = colors[int(category[i])])
plt.scatter(centers[:,0], centers[:,1], marker='*', c='g', s=150)
centers_old = np.zeros(centers.shape) # to store old centers
centers_new = deepcopy(centers) # Store new centers
data1.shape
clusters = np.zeros(n)
distances = np.zeros((n,k))
error = np.linalg.norm(centers_new - centers_old)
# When, after an update, the estimate of that center stays the same, exit loop
while error != 0:
# Measure the distance to every center
for i in range(k):
distances[:,i] = np.linalg.norm(data1 - centers[i], axis=1)
# Assign all training data to closest center
clusters = np.argmin(distances, axis = 1)
centers_old = deepcopy(centers_new)
# Calculate mean for every cluster and update the center
for i in range(k):
centers_new[i] = np.mean(data1[clusters == i], axis=0)
error = np.linalg.norm(centers_new - centers_old)
centers_new
# Plot the data and the centers generated as random
colors=['orange', 'blue', 'green']
for i in range(n):
plt.scatter(data1[i, 0], data1[i,1], s=7, color = colors[int(category[i])])
plt.scatter(centers_new[:,0], centers_new[:,1], marker='*', c='g', s=150)
del originalData["GROUP"]
originalDataScaled = originalData.apply(zscore)
originalDataScaled.head()
#importing seaborn for statistical plots
sns.pairplot(originalDataScaled, height=2,aspect=2 , diag_kind='kde')
from sklearn.cluster import AgglomerativeClustering
model = AgglomerativeClustering(n_clusters=4, affinity='euclidean', linkage='average')
model.fit(originalDataScaled)
originalData['labels'] = model.labels_
originalData.head(10)
originalDataClust = originalData.groupby(['labels'])
originalDataClust.mean()
from scipy.cluster.hierarchy import cophenet, dendrogram, linkage
from scipy.spatial.distance import pdist #Pairwise distribution between data points
# cophenet index is a measure of the correlation between the distance of points in feature space and distance on dendrogram
# closer it is to 1, the better is the clustering
Z = linkage(originalDataScaled, metric='euclidean', method='average')
c, coph_dists = cophenet(Z , pdist(originalDataScaled))
c
plt.figure(figsize=(10, 5))
plt.title('Agglomerative Hierarchical Clustering Dendogram')
plt.xlabel('sample index')
plt.ylabel('Distance')
dendrogram(Z, leaf_rotation=90.,color_threshold = 40, leaf_font_size=8. )
plt.tight_layout()
# cophenet index is a measure of the correlation between the distance of points in feature space and distance on dendrogram
# closer it is to 1, the better is the clustering
Z = linkage(originalDataScaled, metric='euclidean', method='complete')
c, coph_dists = cophenet(Z , pdist(originalDataScaled))
c
plt.figure(figsize=(10, 5))
plt.title('Agglomerative Hierarchical Clustering Dendogram')
plt.xlabel('sample index')
plt.ylabel('Distance')
dendrogram(Z, leaf_rotation=90.,color_threshold=90, leaf_font_size=10. )
plt.tight_layout()
# cophenet index is a measure of the correlation between the distance of points in feature space and distance on dendrogram
# closer it is to 1, the better is the clustering
Z = linkage(originalDataScaled, metric='euclidean', method='ward')
c, coph_dists = cophenet(Z , pdist(originalDataScaled))
c
plt.figure(figsize=(10, 5))
plt.title('Agglomerative Hierarchical Clustering Dendogram')
plt.xlabel('sample index')
plt.ylabel('Distance')
dendrogram(Z, leaf_rotation=90.,color_threshold=600, leaf_font_size=10. )
plt.tight_layout()
After examining the cophenet coefficients of the 3 types of dendrograms (Kind = 'average', kind = 'euclidean', kind = 'ward') which are respectively 0.6987370414272304, 0.7385732265612737, 0.6824190899807201, we observe the number of clusters in them are 4, 10, 55 respectively. We can also closely notice that the cophenet coefficients are very similar for Kind = 'average' and kind = 'euclidean', with a significant difference in number of clusters (4 and 10), and hence we can observe the k=4 clusters is optimum!
Advantages:
Disadvantages:
Advantages:
Disadvantages:
Q1. Mention how many optimal clusters are present in the data and what could be the possible reason behind it.
Answer 1: Determining the optimal number of clusters in a data set is a fundamental issue in partitioning clustering, such as k-means clustering, which requires the user to specify the number of clusters k to be generated. Unfortunately, there is no definitive answer to this question. The optimal number of clusters is somehow subjective and depends on the method used for measuring similarities and the parameters used for partitioning. A simple and popular solution consists of inspecting the dendrogram produced using hierarchical clustering to see if it suggests a particular number of clusters. Unfortunately, this approach is also subjective. Below I’ll describe different methods for determining the optimal number of clusters for k-means, k-medoids (PAM) and hierarchical clustering.
#Finding optimal no. of clusters
from scipy.spatial.distance import cdist
clusters=range(1,10)
meanDistortions=[]
for k in clusters:
model=KMeans(n_clusters=k)
model.fit(dataScaled)
prediction=model.predict(dataScaled)
meanDistortions.append(sum(np.min(cdist(dataScaled, model.cluster_centers_, 'euclidean'), axis=1)) / dataScaled.shape[0])
plt.plot(clusters, meanDistortions, 'bx-')
plt.xlabel('k')
plt.ylabel('Average distortion')
plt.title('Selecting k with the Elbow Method')
plt.axvline(4)
From the elbow method above and after calculating the silhouette coefficient below, we can observe that the optimum number of clusters is 4.
from sklearn import datasets
from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_samples, silhouette_score
km = KMeans(n_clusters=4, random_state=42)
km.fit_predict(dataScaled)
score = silhouette_score(dataScaled, km.labels_, metric='euclidean')
#
# Print the score
#
print('Silhouetter Score: %.3f' % score)
Q2. Use linear regression model on different clusters separately and print the coefficients of the models individually
#Bringing back 'GROUP' column
final_model=KMeans(4)
final_model.fit(dataScaled)
prediction=final_model.predict(dataScaled)
#Append the prediction
dataScaled["GROUP"] = prediction
print("Groups Assigned : \n")
dataScaled.head()
data_Grp0 = dataScaled[dataScaled['GROUP'] == 0]
data_Grp1 = dataScaled[dataScaled['GROUP'] == 1]
data_Grp2 = dataScaled[dataScaled['GROUP'] == 2]
data_Grp3 = dataScaled[dataScaled['GROUP'] == 3]
#Creating dataframes according to groups of clusters
X_0 = data_Grp0[['cyl','disp','hp','wt','acc','yr','origin']]
X_1 = data_Grp1[['cyl','disp','hp','wt','acc','yr','origin']]
X_2 = data_Grp2[['cyl','disp','hp','wt','acc','yr','origin']]
X_3 = data_Grp3[['cyl','disp','hp','wt','acc','yr','origin']]
y_0 = data_Grp0['mpg']
y_1 = data_Grp1['mpg']
y_2 = data_Grp2['mpg']
y_3 = data_Grp3['mpg']
# Train Test Split
# Now let’s split the data into a training set and a testing set.
# We will train out model on the training set and then use the test set to evaluate the model.
from sklearn.model_selection import train_test_split
X_0train, X_0test, y_0train, y_0test = train_test_split(X_0, y_0,
test_size=0.4, random_state=101)
X_1train, X_1test, y_1train, y_1test = train_test_split(X_1, y_1,
test_size=0.4, random_state=101)
X_2train, X_2test, y_2train, y_2test = train_test_split(X_2, y_2,
test_size=0.4, random_state=101)
X_3train, X_3test, y_3train, y_3test = train_test_split(X_3, y_3,
test_size=0.4, random_state=101)
# Creating and Training the Model
lm0 = LinearRegression()
lm0.fit(X_0train,y_0train)
lm1 = LinearRegression()
lm1.fit(X_1train,y_1train)
lm2 = LinearRegression()
lm2.fit(X_2train,y_2train)
lm3 = LinearRegression()
lm3.fit(X_3train,y_3train)
Let’s evaluate the model by checking out it’s coefficients and how we can interpret them.
print(lm0.intercept_)
coeff_df0 = pd.DataFrame(lm0.coef_,X_0.columns,columns=['Coefficient'])
coeff_df0
'mpg' is predicted:
Similarly all other features can be interpreted.
print(lm1.intercept_)
coeff_df1 = pd.DataFrame(lm1.coef_,X_1.columns,columns=['Coefficient'])
coeff_df1
'mpg' is predicted:
Similarly all other features can be interpreted.
print(lm2.intercept_)
coeff_df2 = pd.DataFrame(lm2.coef_,X_2.columns,columns=['Coefficient'])
coeff_df2
'mpg' is predicted:
Similarly all other features can be interpreted.
print(lm3.intercept_)
coeff_df3 = pd.DataFrame(lm3.coef_,X_3.columns,columns=['Coefficient'])
coeff_df3
'mpg' is predicted:
Similarly all other features can be interpreted.
Q3. How using different models for different clusters will be helpful in this case and how it will be different than using one single model without clustering?
Let us see if we use the entire data set if there is variation in the value of parameters and coefficients
X = dataScaled[['cyl','disp','hp','wt','acc','yr','origin']]
y = dataScaled['mpg']
X_train, X_test, y_train, y_test = train_test_split(X, y,
test_size=0.4, random_state=101)
lm = LinearRegression()
lm.fit(X_train,y_train)
print(lm.intercept_)
coeff_df = pd.DataFrame(lm.coef_,X.columns,columns=['Coefficient'])
coeff_df
Detailed suggestions or improvements or on quality, quantity, variety, velocity, veracity etc. on the data points collected by the company to perform a better data analysis in future.
The term “big data” refers to data that is so large, fast or complex that it’s difficult or impossible to process using traditional methods. The act of accessing and storing large amounts of information for analytics has been around a long time.
VOLUME: Organizations collect data from a variety of sources, including business transactions, smart (IoT) devices, industrial equipment, videos, social media and more.
IMPROVEMENT/SUGGESTION: In the past, storing it would have been a problem – but cheaper storage on platforms like data lakes and Hadoop have eased the burden.
VELOCITY: With the growth in the Internet of Things, data streams in to businesses at an unprecedented speed and must be handled in a timely manner.
IMPROVEMENT/SUGGESTION: RFID tags, sensors and smart meters are driving the need to deal with these torrents of data in near-real time.
VARIETY: Data comes in all types of formats – from structured, numeric data in traditional databases to unstructured text documents, emails, videos, audios, stock ticker data and financial transactions.
IMPROVEMENT/SUGGESTION: Have or generate a unified platform to collect data from all sources would be very beneficial for analysis and interpretation.
VERACITY: Veracity refers to the quality of data. Because data comes from so many different sources, it’s difficult to link, match, cleanse and transform data across systems.
IMPROVEMENT/SUGGESTION: Businesses need to connect and correlate relationships, hierarchies and multiple data linkages. Otherwise, their data can quickly spiral out of control.
VARIABILITY: In addition to the increasing velocities and varieties of data, data flows are unpredictable – changing often and varying greatly. >
IMPROVEMENT/SUGGESTION: It’s challenging, but businesses need to know when something is trending in social media, and how to manage daily, seasonal and event-triggered peak data loads.
Manufacturing
Company X curates and packages wine across various vineyards spread throughout the country.
The data concerns the chemical composition of the wine and its respective quality.
Goal is to build a synthetic data generation model using the existing data provided by the company. Steps and tasks: [ Total Score: 5 points]
#Import necessary packages
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.ensemble import RandomForestClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.tree import export_graphviz
from sklearn.tree import plot_tree
from sklearn.svm import SVC
from sklearn.cluster import KMeans
from os import system
from IPython.display import Image
from sklearn.linear_model import SGDClassifier
from sklearn.metrics import confusion_matrix, classification_report
from sklearn.preprocessing import StandardScaler, LabelEncoder
from sklearn.model_selection import train_test_split, GridSearchCV, cross_val_score
%matplotlib inline
#Loading Dataset
wine = pd.read_csv('Part2 - Company.csv')
#Let's check how the data is distributed
wine.head()
#Information about the data columns
wine.info()
wine.isna().any()
# We have made a new dataframe called wine_new removing the NaN values from 'Quality'
# Reason: To construct a Decision Tree we cannot use NaN values. Those data fields can be inferred after tree is created
wine_new = wine.dropna()
wine_new.head()
wine_new.describe()
wine_new.shape
for feature in wine_new.columns: # Loop through all columns in the dataframe
if wine_new[feature].dtype == 'object': # Only apply for columns with categorical strings
wine_new[feature] = pd.Categorical(wine_new[feature])# Replace strings with an integer
wine_new.head(10)
print(wine_new.Quality.value_counts())
wine_new.info()
fig = plt.figure(figsize = (10,6))
sns.barplot(x = 'Quality', y = 'A', data = wine_new)
fig = plt.figure(figsize = (10,6))
sns.barplot(x = 'Quality', y = 'B', data = wine_new)
fig = plt.figure(figsize = (10,6))
sns.barplot(x = 'Quality', y = 'C', data = wine_new)
fig = plt.figure(figsize = (10,6))
sns.barplot(x = 'Quality', y = 'D', data = wine_new)
wine.replace(to_replace = 'Quality A', value=1, inplace=True)
wine.replace(to_replace = 'Quality B', value=0, inplace=True)
wine.head()
X = wine.iloc[:, [0,1,2,3]]
kmeans = KMeans(n_clusters = 2, random_state = 42)
# Compute k-means clustering
kmeans.fit(X)
# Compute cluster centers and predict cluster index for each sample.
pred = kmeans.predict(X)
pred
wine['Cluster_prediction'] = pd.DataFrame(pred, columns=['cluster'] )
print('Number of data points in each cluster= \n', wine['Cluster_prediction'].value_counts())
wine.head(10)
wine_pred = wine.dropna()
wine_pred.head()
# We will now apply a linear model on the dataset wine_pred
from sklearn import linear_model
# Initialize model
regression_model = linear_model.LinearRegression()
# Train the model using the mtcars data
regression_model.fit(X = pd.DataFrame(wine_pred["Quality"]),
y = wine_pred["Cluster_prediction"])
# Check trained model y-intercept
print(regression_model.intercept_)
# Check trained model coefficients
print(regression_model.coef_)
The output above shows the model intercept and coefficients used to create the best fit line. In this case the y-intercept term is set to -2.220446049250313e-16 and the coefficient for the weight variable is 1. In other words, the model fit the line: Cluster_prediction = -2.220446049250313e-16 + Quality.
We can get a sense of how much of the variance in the response variable is explained by the model using the model.score() function:
regression_model.score(X = pd.DataFrame(wine_pred["Quality"]),
y = wine_pred["Cluster_prediction"])
The output of the score function for linear regression is "R-squared", a value that ranges from 0 to 1 which describes the proportion of variance in the response variable that is explained by the model. In this case, Quality explains roughly 100% of the variance in Cluster_prediction.
Since there are only 61 values and our R^2 value is very high, we can impute the missing values (18 values) in the table to those found in the Cluster_prediction.
wine.head()
#Imputing missing values and then dropping Cluster_prediction column
wine['Quality'] = wine['Quality'].combine_first(wine['Cluster_prediction'])
del wine['Cluster_prediction']
wine
Automobile
The purpose is to classify a given silhouette as one of three types of vehicle, using a set of features extracted from the silhouette. The vehicle may be viewed from one of many different angles.
The data contains features extracted from the silhouette of vehicles in different angles. Four "Corgie" model vehicles were used for the experiment: a double decker bus, Cheverolet van, Saab 9000 and an Opel Manta 400 cars. This particular combination of vehicles was chosen with the expectation that the bus, van and either one of the cars would be readily distinguishable, but it would be more difficult to distinguish between the cars. • All the features are numeric i.e. geometric features extracted from the silhouette.
Apply dimensionality reduction technique – PCA and train a model using principal components instead of training the model using just the raw data.
#import the necessary libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.decomposition import PCA
from sklearn.svm import SVC
from sklearn.metrics import accuracy_score,confusion_matrix
from scipy.stats import zscore
from sklearn.model_selection import train_test_split
#load the csv file and make the data frame
vehicle_df = pd.read_csv('Part3 - vehicle.csv')
#display the first 5 rows of dataframe
vehicle_df.head()
print("The dataframe has {} rows and {} columns".format(vehicle_df.shape[0],vehicle_df.shape[1]))
#display the information of dataframe
vehicle_df.info()
From above we can see that except 'class' column all columns are numeric type and there are null values in some columns. class column is our target column.
#display in each column how many null values are there
vehicle_df.apply(lambda x: sum(x.isnull()))
From above we can see that max null values is 6 which are in two columns 'radius_ratio', 'skewness_about'. so we have two options either we will drop those null values or we will impute those null values. Dropping null values is not a good way because we will lose some information.but we will go with both options then we will see what's the effect on model.
#display 5 point summary of dataframe
vehicle_df.describe().transpose()
sns.pairplot(vehicle_df,diag_kind='kde')
plt.show()
From above pair plots we can see that many columns are correlated and many columns have long tail so that is the indication of outliers.we will see down the line with the help of correlation matrix what's the strength of correlation and outliers are there or not.
From above we can see that our data has missing values in some column. so before building any model we have to handle missing values. we have two option either we will drop those missing values or we will impute missing values. we will go with both options and see what's the effect on model. so first we will drop the missing values. Before dropping missing values we will create another dataframe and copy the original dataframe data into that. It's a good practice to keep the original dataframe as it is and make all modifications to the new dataframe.
#copy the dataframe to another dataframe and drop null/missing values from the newly created dataframe
new_vehicle_df = vehicle_df.copy()
# so now we have new dataframe called new_vehicle_df and we will make changes in this new dataframe.
#display the first 5 rows of new dataframe
new_vehicle_df.head()
#display the shape of dataframe
print("Shape of newly created dataframe:",new_vehicle_df.shape)
#drop the null vaues from the new dataframe
new_vehicle_df.dropna(axis=0,inplace=True)
#now we will see what is the shape of dataframe
print("After dropping missing values shape of dataframe:",new_vehicle_df.shape)
#display 5 point summary of new dataframe
new_vehicle_df.describe().transpose()
fig,(ax1,ax2) = plt.subplots(nrows=1,ncols=2)
fig.set_size_inches(20,4)
sns.distplot(new_vehicle_df['compactness'],ax=ax1)
ax1.set_title("Distribution Plot")
sns.boxplot(new_vehicle_df['compactness'],ax=ax2)
ax2.set_title("Box Plot")
From above we can see that there are no outliers in compactness column and it's looks like normally distributed.
fig,(ax1,ax2) = plt.subplots(nrows=1,ncols=2)
fig.set_size_inches(20,4)
sns.distplot(new_vehicle_df['circularity'],ax=ax1)
ax1.set_title("Distribution Plot")
sns.boxplot(new_vehicle_df['circularity'],ax=ax2)
ax2.set_title("Box Plot")
From above we can see that there are no outliers in circularity column and it's looks like normally distributed
fig,(ax1,ax2) = plt.subplots(nrows=1,ncols=2)
fig.set_size_inches(20,4)
sns.distplot(new_vehicle_df['distance_circularity'],ax=ax1)
ax1.set_title("Distribution Plot")
sns.boxplot(new_vehicle_df['distance_circularity'],ax=ax2)
ax2.set_title("Box Plot")
From above we can see that there are no outliers in distance_circularity column but in distribution plot we can see that there are two peaks and we can see that there is right skewness because long tail is at the right side(mean>median)
fig,(ax1,ax2) = plt.subplots(nrows=1,ncols=2)
fig.set_size_inches(20,4)
sns.distplot(new_vehicle_df['radius_ratio'],ax=ax1)
ax1.set_title("Distribution Plot")
sns.boxplot(new_vehicle_df['radius_ratio'],ax=ax2)
ax2.set_title("Box Plot")
From above we can see that there are outliers in radius_ratio column and there is right skewness because long tail is at the right side(mean>median)
#check how many outliers are there in radius_ratio column
q1 = np.quantile(new_vehicle_df['radius_ratio'],0.25)
q2 = np.quantile(new_vehicle_df['radius_ratio'],0.50)
q3 = np.quantile(new_vehicle_df['radius_ratio'],0.75)
IQR = q3-q1
print("Quartie1::",q1)
print("Quartie2::",q2)
print("Quartie3::",q3)
print("Inter Quartie Range::",IQR)
#outliers = q3 + 1.5*IQR, q1 - 1.5*IQR
print("radius_ratio above",new_vehicle_df['radius_ratio'].quantile(0.75)+(1.5 * IQR),"are outliers")
print("The Outliers in radius_ratio column are",new_vehicle_df[new_vehicle_df['radius_ratio']>276]['radius_ratio'].shape[0])
fig,(ax1,ax2) = plt.subplots(nrows=1,ncols=2)
fig.set_size_inches(20,4)
sns.distplot(new_vehicle_df['pr.axis_aspect_ratio'],ax=ax1)
ax1.set_title("Distribution Plot")
sns.boxplot(new_vehicle_df['pr.axis_aspect_ratio'],ax=ax2)
ax2.set_title("Box Plot")
From above we can see that there are outliers in pr.axis_aspect_ratio column and there is right skewness because long tail is at right side(mean>median)
#check how many outliers are there in pr.axis_aspect_ratio column
q1 = np.quantile(new_vehicle_df['pr.axis_aspect_ratio'],0.25)
q2 = np.quantile(new_vehicle_df['pr.axis_aspect_ratio'],0.50)
q3 = np.quantile(new_vehicle_df['pr.axis_aspect_ratio'],0.75)
IQR = q3-q1
print("Quartie1::",q1)
print("Quartie2::",q2)
print("Quartie3::",q3)
print("Inter Quartie Range::",IQR)
#outliers = q3 + 1.5*IQR, q1 - 1.5*IQR
print("pr.axis_aspect_ratio above",new_vehicle_df['pr.axis_aspect_ratio'].quantile(0.75)+(1.5 * IQR),"are outliers")
print("The Outliers in pr.axis_aspect_ratio column are",new_vehicle_df[new_vehicle_df['pr.axis_aspect_ratio']>77]['pr.axis_aspect_ratio'].shape[0])
fig,(ax1,ax2) = plt.subplots(nrows=1,ncols=2)
fig.set_size_inches(20,4)
sns.distplot(new_vehicle_df['max.length_aspect_ratio'],ax=ax1)
ax1.set_title("Distribution Plot")
sns.boxplot(new_vehicle_df['max.length_aspect_ratio'],ax=ax2)
ax2.set_title("Box Plot")
From above we can see that there are outliers in max.length_aspect_ratio and there is a right skewness because long tail is at right side(mean>median)
#check how many outliers are there in pr.axis_aspect_ratio column
q1 = np.quantile(new_vehicle_df['max.length_aspect_ratio'],0.25)
q2 = np.quantile(new_vehicle_df['max.length_aspect_ratio'],0.50)
q3 = np.quantile(new_vehicle_df['max.length_aspect_ratio'],0.75)
IQR = q3-q1
print("Quartie1::",q1)
print("Quartie2::",q2)
print("Quartie3::",q3)
print("Inter Quartie Range::",IQR)
#outliers = q3 + 1.5*IQR, q1 - 1.5*IQR
print("max.length_aspect_ratio above",new_vehicle_df['max.length_aspect_ratio'].quantile(0.75)+(1.5 * IQR),"are outliers")
print("max.length_aspect_ratio below",new_vehicle_df['max.length_aspect_ratio'].quantile(0.25)-(1.5 * IQR),"are outliers")
print("The above Outliers in max.length_aspect_ratio column are",new_vehicle_df[new_vehicle_df['max.length_aspect_ratio']>14.5]['max.length_aspect_ratio'].shape[0])
print("The below Outliers in max.length_aspect_ratio column are",new_vehicle_df[new_vehicle_df['max.length_aspect_ratio']<2.5]['max.length_aspect_ratio'].shape[0])
fig,(ax1,ax2) = plt.subplots(nrows=1,ncols=2)
fig.set_size_inches(20,4)
sns.distplot(new_vehicle_df['scatter_ratio'],ax=ax1)
ax1.set_title("Distribution Plot")
sns.boxplot(new_vehicle_df['scatter_ratio'],ax=ax2)
ax2.set_title("Box Plot")
From above we can see that there are no outliers in scatter_ratio column and there are two peaks in distribution plot and there is right skewness because long tail is at right side(mean>median)
fig,(ax1,ax2) = plt.subplots(nrows=1,ncols=2)
fig.set_size_inches(20,4)
sns.distplot(new_vehicle_df['elongatedness'],ax=ax1)
ax1.set_title("Distribution Plot")
sns.boxplot(new_vehicle_df['elongatedness'],ax=ax2)
ax2.set_title("Box Plot")
From above we can see that there are no outliers in elongatedness column and there are two peaks in distribution plot and there is left skewness because long tail is at left side(mean<median)
fig,(ax1,ax2) = plt.subplots(nrows=1,ncols=2)
fig.set_size_inches(20,4)
sns.distplot(new_vehicle_df['pr.axis_rectangularity'],ax=ax1)
ax1.set_title("Distribution Plot")
sns.boxplot(new_vehicle_df['pr.axis_rectangularity'],ax=ax2)
ax2.set_title("Box Plot")
From above we can see that there are no outliers in pr.axis_rectangularity column and there are two peaks in distribution plot and there is right skewness because long tail is at right side(mean>median)
fig,(ax1,ax2) = plt.subplots(nrows=1,ncols=2)
fig.set_size_inches(20,4)
sns.distplot(new_vehicle_df['max.length_rectangularity'],ax=ax1)
ax1.set_title("Distribution Plot")
sns.boxplot(new_vehicle_df['max.length_rectangularity'],ax=ax2)
ax2.set_title("Box Plot")
From above we can see that there are no outliers in max.length_rectangularity column and there are two peaks in distribution plot and there is right skewness because long tail is at right side(mean>median)
fig,(ax1,ax2) = plt.subplots(nrows=1,ncols=2)
fig.set_size_inches(20,4)
sns.distplot(new_vehicle_df['scaled_variance'],ax=ax1)
ax1.set_title("Distribution Plot")
sns.boxplot(new_vehicle_df['scaled_variance'],ax=ax2)
ax2.set_title("Box Plot")
From above we can see that there are outliers in scaled_variance column and there are two peaks in distribution plot and there is right skewness because long tail is at right side(mean>median)
#check how many outliers are there in scaled_variance column
q1 = np.quantile(new_vehicle_df['scaled_variance'],0.25)
q2 = np.quantile(new_vehicle_df['scaled_variance'],0.50)
q3 = np.quantile(new_vehicle_df['scaled_variance'],0.75)
IQR = q3-q1
print("Quartie1::",q1)
print("Quartie2::",q2)
print("Quartie3::",q3)
print("Inter Quartie Range::",IQR)
#outliers = q3 + 1.5*IQR, q1 - 1.5*IQR
print("scaled_variance above",new_vehicle_df['scaled_variance'].quantile(0.75)+(1.5 * IQR),"are outliers")
print("The Outliers in scaled_variance column are",new_vehicle_df[new_vehicle_df['scaled_variance']>292]['scaled_variance'].shape[0])
fig,(ax1,ax2) = plt.subplots(nrows=1,ncols=2)
fig.set_size_inches(20,4)
sns.distplot(new_vehicle_df['scaled_variance.1'],ax=ax1)
ax1.set_title("Distribution Plot")
sns.boxplot(new_vehicle_df['scaled_variance.1'],ax=ax2)
ax2.set_title("Box Plot")
From above we can see that there are outliers in scaled_variance.1 column and there are two peaks in distribution plot and there is right skewness because long tail is at right side(mean>median)
#check how many outliers are there in scaled_variance.1 column
q1 = np.quantile(new_vehicle_df['scaled_variance.1'],0.25)
q2 = np.quantile(new_vehicle_df['scaled_variance.1'],0.50)
q3 = np.quantile(new_vehicle_df['scaled_variance.1'],0.75)
IQR = q3-q1
print("Quartie1::",q1)
print("Quartie2::",q2)
print("Quartie3::",q3)
print("Inter Quartie Range::",IQR)
#outliers = q3 + 1.5*IQR, q1 - 1.5*IQR
print("scaled_variance.1 above",new_vehicle_df['scaled_variance.1'].quantile(0.75)+(1.5 * IQR),"are outliers")
print("The Outliers in scaled_variance.1 column are",new_vehicle_df[new_vehicle_df['scaled_variance.1']>988]['scaled_variance.1'].shape[0])
fig,(ax1,ax2) = plt.subplots(nrows=1,ncols=2)
fig.set_size_inches(20,4)
sns.distplot(new_vehicle_df['scaled_radius_of_gyration'],ax=ax1)
ax1.set_title("Distribution Plot")
sns.boxplot(new_vehicle_df['scaled_radius_of_gyration'],ax=ax2)
ax2.set_title("Box Plot")
From above we can see that there are no outliers in scaled_radius_of_gyration column and there is right skewness because long tail is at right side(mean>median)
fig,(ax1,ax2) = plt.subplots(nrows=1,ncols=2)
fig.set_size_inches(20,4)
sns.distplot(new_vehicle_df['scaled_radius_of_gyration.1'],ax=ax1)
ax1.set_title("Distribution Plot")
sns.boxplot(new_vehicle_df['scaled_radius_of_gyration.1'],ax=ax2)
ax2.set_title("Box Plot")
From above we can see that there are outliers in scaled_radius_of_gyration.1 column and there is right skewness because long tail is at right side(mean>median)
#check how many outliers are there in scaled_radius_of_gyration.1 column
q1 = np.quantile(new_vehicle_df['scaled_radius_of_gyration.1'],0.25)
q2 = np.quantile(new_vehicle_df['scaled_radius_of_gyration.1'],0.50)
q3 = np.quantile(new_vehicle_df['scaled_radius_of_gyration.1'],0.75)
IQR = q3-q1
print("Quartie1::",q1)
print("Quartie2::",q2)
print("Quartie3::",q3)
print("Inter Quartie Range::",IQR)
#outliers = q3 + 1.5*IQR, q1 - 1.5*IQR
print("scaled_radius_of_gyration.1 above",new_vehicle_df['scaled_radius_of_gyration.1'].quantile(0.75)+(1.5 * IQR),"are outliers")
print("The Outliers in scaled_radius_of_gyration.1 column are",new_vehicle_df[new_vehicle_df['scaled_radius_of_gyration.1']>87]['scaled_radius_of_gyration.1'].shape[0])
fig,(ax1,ax2) = plt.subplots(nrows=1,ncols=2)
fig.set_size_inches(20,4)
sns.distplot(new_vehicle_df['skewness_about'],ax=ax1)
ax1.set_title("Distribution Plot")
sns.boxplot(new_vehicle_df['skewness_about'],ax=ax2)
ax2.set_title("Box Plot")
From above we can see that there are outliers in skewness_about column and there is right skewness because long tail is at right side(mean>median)
#check how many outliers are there in skewness_about column
q1 = np.quantile(new_vehicle_df['skewness_about'],0.25)
q2 = np.quantile(new_vehicle_df['skewness_about'],0.50)
q3 = np.quantile(new_vehicle_df['skewness_about'],0.75)
IQR = q3-q1
print("Quartie1::",q1)
print("Quartie2::",q2)
print("Quartie3::",q3)
print("Inter Quartie Range::",IQR)
#outliers = q3 + 1.5*IQR, q1 - 1.5*IQR
print("skewness_about above",new_vehicle_df['skewness_about'].quantile(0.75)+(1.5 * IQR),"are outliers")
print("The Outliers in skewness_about column are",new_vehicle_df[new_vehicle_df['skewness_about']>19.5]['skewness_about'].shape[0])
fig,(ax1,ax2) = plt.subplots(nrows=1,ncols=2)
fig.set_size_inches(20,4)
sns.distplot(new_vehicle_df['skewness_about.1'],ax=ax1)
ax1.set_title("Distribution Plot")
sns.boxplot(new_vehicle_df['skewness_about.1'],ax=ax2)
ax2.set_title("Box Plot")
From above we can see that there are outliers in skewness_about.1 column and there is right skewness because long tail is at right side(mean>median)
#check how many outliers are there in skewness_about.1 column
q1 = np.quantile(new_vehicle_df['skewness_about.1'],0.25)
q2 = np.quantile(new_vehicle_df['skewness_about.1'],0.50)
q3 = np.quantile(new_vehicle_df['skewness_about.1'],0.75)
IQR = q3-q1
print("Quartie1::",q1)
print("Quartie2::",q2)
print("Quartie3::",q3)
print("Inter Quartie Range::",IQR)
#outliers = q3 + 1.5*IQR, q1 - 1.5*IQR
print("skewness_about.1 above",new_vehicle_df['skewness_about.1'].quantile(0.75)+(1.5 * IQR),"are outliers")
print("The Outliers in skewness_about.1 column are",new_vehicle_df[new_vehicle_df['skewness_about.1']>38.5]['skewness_about.1'].shape[0])
fig,(ax1,ax2) = plt.subplots(nrows=1,ncols=2)
fig.set_size_inches(20,4)
sns.distplot(new_vehicle_df['skewness_about.2'],ax=ax1)
ax1.set_title("Distribution Plot")
sns.boxplot(new_vehicle_df['skewness_about.2'],ax=ax2)
ax2.set_title("Box Plot")
From above we can see that there are no outliers in skewness_about.2 column and there is left skewness because long tail is at left side(mean<median)
fig,(ax1,ax2) = plt.subplots(nrows=1,ncols=2)
fig.set_size_inches(20,4)
sns.distplot(new_vehicle_df['hollows_ratio'],ax=ax1)
ax1.set_title("Distribution Plot")
sns.boxplot(new_vehicle_df['hollows_ratio'],ax=ax2)
ax2.set_title("Box Plot")
From above we can see that there are no outliers in hollows_ratio column and there is left skewness because long tail is at left side(mean<median)
#display how many are car,bus,van.
new_vehicle_df['class'].value_counts()
sns.countplot(new_vehicle_df['class'])
plt.show()
From above we can see that cars are most followed by bus and then vans.
so by now we analyze each column and we found that there are outliers in some column. now our next step is to know whether these outliers are natural or artificial. if natural then we have to do nothing but if these outliers are artificial then we have to handle these outliers. we have 8 columns in which we found outliers: ->radius_ratio ->pr.axis_aspect_ratio ->max.length_aspect_ratio ->scaled_variance ->scaled_variance.1 ->scaled_radius_of_gyration.1 ->skewness_about ->skewness_about.1
after seeing the max values of above outliers column. it's looks like outliers in above columns are natural not a typo mistake or artificial. Note: It's my assumption only. as there is no way to prove whether these outliers are natural or artificial. As we know that mostly algorithms are affected by outliers and outliers may affect the model.as we will apply SVM on above data which is affected by outliers. so better to drop those outliers.
#radius_ratio column outliers
new_vehicle_df.drop(new_vehicle_df[new_vehicle_df['radius_ratio']>276].index,axis=0,inplace=True)
#pr.axis_aspect_ratio column outliers
new_vehicle_df.drop(new_vehicle_df[new_vehicle_df['pr.axis_aspect_ratio']>77].index,axis=0,inplace=True)
#max.length_aspect_ratio column outliers
new_vehicle_df.drop(new_vehicle_df[new_vehicle_df['max.length_aspect_ratio']>14.5].index,axis=0,inplace=True)
new_vehicle_df.drop(new_vehicle_df[new_vehicle_df['max.length_aspect_ratio']<2.5].index,axis=0,inplace=True)
#scaled_variance column outliers
new_vehicle_df[new_vehicle_df['scaled_variance']>292]
From above we can see that scaled_variance column outliers has been removed
#scaled_variance.1 column outliers
new_vehicle_df.drop(new_vehicle_df[new_vehicle_df['scaled_variance.1']>988].index,axis=0,inplace=True)
#scaled_radius_of_gyration.1 column outliers
new_vehicle_df.drop(new_vehicle_df[new_vehicle_df['scaled_radius_of_gyration.1']>87].index,axis=0,inplace=True)
#skewness_about column outliers
new_vehicle_df.drop(new_vehicle_df[new_vehicle_df['skewness_about']>19.5].index,axis=0,inplace=True)
#skewness_about.1 column outliers
new_vehicle_df.drop(new_vehicle_df[new_vehicle_df['skewness_about.1']>38.5].index,axis=0,inplace=True)
#now what is the shape of dataframe
print("after removing outliers shape of dataframe:",new_vehicle_df.shape)
#find the correlation between independent variables
plt.figure(figsize=(20,5))
sns.heatmap(new_vehicle_df.corr(),annot=True)
plt.show()
So our objective is to reocgnize whether an object is a van or bus or car based on some input features. so our main assumption is there is little or no multicollinearity between the features. if two features is highly correlated then there is no use in using both features.in that case, we can drop one feature. so heatmap gives us the correlation matrix there we can see which features are highly correlated. From above correlation matrix we can see that there are many features which are highly correlated. if we see carefully then scaled_variance.1 and scatter_ratio has 1 correlation and many other features also there which having more than 0.9 correlation so we will drop those columns whose correlation is +-0.9 or above. so there are 8 such columns: ->max.length_rectangularity ->scaled_radius_of_gyration ->skewness_about.2 ->scatter_ratio ->elongatedness ->pr.axis_rectangularity ->scaled_variance ->scaled_variance.1
now, again we have two option we will drop those above eight columns manually or we will apply pca and let pca to be decided how it will explain above data which is in high dimension with smaller number of variables. we will see both approaches.
Principal Component Analysis is an unsupervised learning class of statistical techniques used to explain data in high dimension using small number of variables called the principal components. Principal components are the linear combinations of the original variables in the dataset. As it will explain high dimension data with small number of variables. The big disadvantage is we cannot do interpretation with the model.In other words model with pca will become blackbox. In pca first we have to find the covariance matrix after that from that covariance matrix we have to find eigen vectors and eigen values. There is mathematical way to find eigen vectors and eigen values. i will attach the link of how to find the eigen value and eigen vector. Corresponding to each eigen vector there is eigen value. after that we have to sort the eigen vector by decreasing eigen values and choose k eigen vectors with the largest eigen value.
#now separate the dataframe into dependent and independent variables
new_vehicle_df_independent_attr = new_vehicle_df.drop('class',axis=1)
new_vehicle_df_dependent_attr = new_vehicle_df['class']
print("shape of new_vehicle_df_independent_attr::",new_vehicle_df_independent_attr.shape)
print("shape of new_vehicle_df_dependent_attr::",new_vehicle_df_dependent_attr.shape)
#now sclaed the independent attribute and replace the dependent attr value with number
new_vehicle_df_independent_attr_scaled = new_vehicle_df_independent_attr.apply(zscore)
new_vehicle_df_dependent_attr.replace({'car':0,'bus':1,'van':2},inplace=True)
#make the covariance matrix and we have 18 independent features so aur covariance matrix is 18*18 matrix
cov_matrix = np.cov(new_vehicle_df_independent_attr_scaled,rowvar=False)
print("cov_matrix shape:",cov_matrix.shape)
print("Covariance_matrix",cov_matrix)
#now with the help of above covariance matrix we will find eigen value and eigen vectors
pca_to_learn_variance = PCA(n_components=18)
pca_to_learn_variance.fit(new_vehicle_df_independent_attr_scaled)
#display explained variance ratio
pca_to_learn_variance.explained_variance_ratio_
#display explained variance
pca_to_learn_variance.explained_variance_
#display principal components
pca_to_learn_variance.components_
plt.bar(list(range(1,19)),pca_to_learn_variance.explained_variance_ratio_)
plt.xlabel("eigen value/components")
plt.ylabel("variation explained")
plt.show()
plt.step(list(range(1,19)),np.cumsum(pca_to_learn_variance.explained_variance_ratio_))
plt.xlabel("eigen value/components")
plt.ylabel("cummalative of variation explained")
plt.show()
From above we can see that 8 dimension are able to explain 95%variance of data. so we will use first 8 principal components
#use first 8 principal components
pca_eight_components = PCA(n_components=8)
pca_eight_components.fit(new_vehicle_df_independent_attr_scaled)
#transform the raw data which is in 18 dimension into 8 new dimension with pca
new_vehicle_df_pca_independent_attr = pca_eight_components.transform(new_vehicle_df_independent_attr_scaled)
#display the shape of new_vehicle_df_pca_independent_attr
new_vehicle_df_pca_independent_attr.shape
Now before apply pca with 8 dimension which are explaining more than 95% variantion of data we will make model on raw data after that we will make model with pca and then we will compare both models.
#now split the data into 80:20 ratio
rawdata_X_train,rawdata_X_test,rawdata_y_train,rawdata_y_test = train_test_split(new_vehicle_df_independent_attr_scaled,new_vehicle_df_dependent_attr,test_size=0.20,random_state=1)
pca_X_train,pca_X_test,pca_y_train,pca_y_test = train_test_split(new_vehicle_df_pca_independent_attr,new_vehicle_df_dependent_attr,test_size=0.20,random_state=1)
print("shape of rawdata_X_train",rawdata_X_train.shape)
print("shape of rawdata_y_train",rawdata_y_train.shape)
print("shape of rawdata_X_test",rawdata_X_test.shape)
print("shape of rawdata_y_test",rawdata_y_test.shape)
print("--------------------------------------------")
print("shape of pca_X_train",pca_X_train.shape)
print("shape of pca_y_train",pca_y_train.shape)
print("shape of pca_X_test",pca_X_test.shape)
print("shape of pca_y_test",pca_y_test.shape)
#now we will train the model with both raw data and pca data with new dimension
svc = SVC() #instantiate the object
#fit the model on raw data
svc.fit(rawdata_X_train,rawdata_y_train)
#predict the y value
rawdata_y_predict = svc.predict(rawdata_X_test)
#now fit the model on pca data with new dimension
svc.fit(pca_X_train,pca_y_train)
#predict the y value
pca_y_predict = svc.predict(pca_X_test)
#display accuracy score of both models
print("Accuracy score with raw data(18 dimension)",accuracy_score(rawdata_y_test,rawdata_y_predict))
print("Accuracy score with pca data(8 dimension)",accuracy_score(pca_y_test,pca_y_predict))
From above we can see that by reducing 10 dimension we are achieving 94% accuracy
#display confusion matrix of both models
print("Confusion matrix with raw data(18 dimension)\n",confusion_matrix(rawdata_y_test,rawdata_y_predict))
print("Confusion matrix with pca data(8 dimension)\n",confusion_matrix(pca_y_test,pca_y_predict))
#drop the columns
new_vehicle_df_independent_attr_scaled.drop(['max.length_rectangularity','scaled_radius_of_gyration','skewness_about.2','scatter_ratio','elongatedness','pr.axis_rectangularity','scaled_variance','scaled_variance.1'],axis=1,inplace=True)
#display the shape of new dataframe
new_vehicle_df_independent_attr_scaled.shape
dropcolumn_X_train,dropcolumn_X_test,dropcolumn_y_train,dropcolumn_y_test = train_test_split(new_vehicle_df_independent_attr_scaled,new_vehicle_df_dependent_attr,test_size=0.20,random_state=1)
print("shape of dropcolumn_X_train",dropcolumn_X_train.shape)
print("shape of dropcolumn_y_train",dropcolumn_y_train.shape)
print("shape of dropcolumn_X_test",dropcolumn_X_test.shape)
print("shape of dropcolumn_y_test",dropcolumn_y_test.shape)
#fit the model on dropcolumn_X_train,dropcolumn_y_train
svc.fit(dropcolumn_X_train,dropcolumn_y_train)
#predict the y value
dropcolumn_y_predict = svc.predict(dropcolumn_X_test)
#display the accuracy score and confusion matrix
print("Accuracy score with dropcolumn data(10 dimension)",accuracy_score(dropcolumn_y_test,dropcolumn_y_predict))
print("Confusion matrix with dropcolumn data(10 dimension)\n",confusion_matrix(dropcolumn_y_test,dropcolumn_y_predict))
First let's create a new dataframe and then we will impute the missing values.
#create a new dataframe
impute_vehicle_df = vehicle_df.copy()
#display the first 5 rows of dataframe
impute_vehicle_df.head()
#display the shape of dataframe
impute_vehicle_df.shape
#display the information of dataframe
impute_vehicle_df.info()
From above we can see that there are null values in some column.now we will impute those null values.
#display 5 point summary
impute_vehicle_df.describe().transpose()
From above 5 point summary it's looks like we can impute with median.again by imputing the missing values with median we are changing the shape of distribution and introducing bias.but it's might be better than drpping missing values.
impute_vehicle_df.fillna(impute_vehicle_df.median(),axis=0,inplace=True)
#display the info of dataframe
impute_vehicle_df.info()
From above we can see that there are no null values in each column
#display 5 point summary after imputation
impute_vehicle_df.describe().transpose()
fig,(ax1,ax2) = plt.subplots(nrows=1,ncols=2)
fig.set_size_inches(20,4)
sns.distplot(impute_vehicle_df['compactness'],ax=ax1)
ax1.set_title("Distribution Plot")
sns.boxplot(impute_vehicle_df['compactness'],ax=ax2)
ax2.set_title("Box Plot")
From above we can see that there are no outliers in compactness column and it's looks like normally distributed.
fig,(ax1,ax2) = plt.subplots(nrows=1,ncols=2)
fig.set_size_inches(20,4)
sns.distplot(impute_vehicle_df['circularity'],ax=ax1)
ax1.set_title("Distribution Plot")
sns.boxplot(impute_vehicle_df['circularity'],ax=ax2)
ax2.set_title("Box Plot")
From above we can see that there are no outliers in circularity column and it's looks like normally distributed
fig,(ax1,ax2) = plt.subplots(nrows=1,ncols=2)
fig.set_size_inches(20,4)
sns.distplot(impute_vehicle_df['distance_circularity'],ax=ax1)
ax1.set_title("Distribution Plot")
sns.boxplot(impute_vehicle_df['distance_circularity'],ax=ax2)
ax2.set_title("Box Plot")
From above we can see that there are no outliers in distance_circularity column but in distribution plot we can see that there are two peaks and we can see that there is right skewness because long tail is at the right side(mean>median)
fig,(ax1,ax2) = plt.subplots(nrows=1,ncols=2)
fig.set_size_inches(20,4)
sns.distplot(impute_vehicle_df['radius_ratio'],ax=ax1)
ax1.set_title("Distribution Plot")
sns.boxplot(impute_vehicle_df['radius_ratio'],ax=ax2)
ax2.set_title("Box Plot")
From above we can see that there are outliers in radius_ratio column and there is right skewness because long tail is at the right side(mean>median)
#check how many outliers are there in radius_ratio column
q1 = np.quantile(impute_vehicle_df['radius_ratio'],0.25)
q2 = np.quantile(impute_vehicle_df['radius_ratio'],0.50)
q3 = np.quantile(impute_vehicle_df['radius_ratio'],0.75)
IQR = q3-q1
print("Quartie1::",q1)
print("Quartie2::",q2)
print("Quartie3::",q3)
print("Inter Quartie Range::",IQR)
#outliers = q3 + 1.5*IQR, q1 - 1.5*IQR
print("radius_ratio above",impute_vehicle_df['radius_ratio'].quantile(0.75)+(1.5 * IQR),"are outliers")
print("The Outliers in radius_ratio column are",impute_vehicle_df[impute_vehicle_df['radius_ratio']>276]['radius_ratio'].shape[0])
fig,(ax1,ax2) = plt.subplots(nrows=1,ncols=2)
fig.set_size_inches(20,4)
sns.distplot(impute_vehicle_df['pr.axis_aspect_ratio'],ax=ax1)
ax1.set_title("Distribution Plot")
sns.boxplot(impute_vehicle_df['pr.axis_aspect_ratio'],ax=ax2)
ax2.set_title("Box Plot")
From above we can see that there are outliers in pr.axis_aspect_ratio column and there is right skewness because long tail is at right side(mean>median)
#check how many outliers are there in pr.axis_aspect_ratio column
q1 = np.quantile(impute_vehicle_df['pr.axis_aspect_ratio'],0.25)
q2 = np.quantile(impute_vehicle_df['pr.axis_aspect_ratio'],0.50)
q3 = np.quantile(impute_vehicle_df['pr.axis_aspect_ratio'],0.75)
IQR = q3-q1
print("Quartie1::",q1)
print("Quartie2::",q2)
print("Quartie3::",q3)
print("Inter Quartie Range::",IQR)
#outliers = q3 + 1.5*IQR, q1 - 1.5*IQR
print("pr.axis_aspect_ratio above",impute_vehicle_df['pr.axis_aspect_ratio'].quantile(0.75)+(1.5 * IQR),"are outliers")
print("The Outliers in pr.axis_aspect_ratio column are",impute_vehicle_df[impute_vehicle_df['pr.axis_aspect_ratio']>77]['pr.axis_aspect_ratio'].shape[0])
fig,(ax1,ax2) = plt.subplots(nrows=1,ncols=2)
fig.set_size_inches(20,4)
sns.distplot(impute_vehicle_df['max.length_aspect_ratio'],ax=ax1)
ax1.set_title("Distribution Plot")
sns.boxplot(impute_vehicle_df['max.length_aspect_ratio'],ax=ax2)
ax2.set_title("Box Plot")
From above we can see that there are outliers in max.length_aspect_ratio and there is a right skewness because long tail is at right side(mean>median)
#check how many outliers are there in max.length_aspect_ratio column
q1 = np.quantile(impute_vehicle_df['max.length_aspect_ratio'],0.25)
q2 = np.quantile(impute_vehicle_df['max.length_aspect_ratio'],0.50)
q3 = np.quantile(impute_vehicle_df['max.length_aspect_ratio'],0.75)
IQR = q3-q1
print("Quartie1::",q1)
print("Quartie2::",q2)
print("Quartie3::",q3)
print("Inter Quartie Range::",IQR)
#outliers = q3 + 1.5*IQR, q1 - 1.5*IQR
print("max.length_aspect_ratio above",impute_vehicle_df['max.length_aspect_ratio'].quantile(0.75)+(1.5 * IQR),"are outliers")
print("max.length_aspect_ratio below",impute_vehicle_df['max.length_aspect_ratio'].quantile(0.25)-(1.5 * IQR),"are outliers")
print("The above Outliers in max.length_aspect_ratio column are",impute_vehicle_df[impute_vehicle_df['max.length_aspect_ratio']>14.5]['max.length_aspect_ratio'].shape[0])
print("The below Outliers in max.length_aspect_ratio column are",impute_vehicle_df[impute_vehicle_df['max.length_aspect_ratio']<2.5]['max.length_aspect_ratio'].shape[0])
fig,(ax1,ax2) = plt.subplots(nrows=1,ncols=2)
fig.set_size_inches(20,4)
sns.distplot(impute_vehicle_df['scatter_ratio'],ax=ax1)
ax1.set_title("Distribution Plot")
sns.boxplot(impute_vehicle_df['scatter_ratio'],ax=ax2)
ax2.set_title("Box Plot")
From above we can see that there are no outliers in scatter_ratio column and there are two peaks in distribution plot and there is right skewness because long tail is at right side(mean>median)
fig,(ax1,ax2) = plt.subplots(nrows=1,ncols=2)
fig.set_size_inches(20,4)
sns.distplot(impute_vehicle_df['elongatedness'],ax=ax1)
ax1.set_title("Distribution Plot")
sns.boxplot(impute_vehicle_df['elongatedness'],ax=ax2)
ax2.set_title("Box Plot")
From above we can see that there are no outliers in elongatedness column and there are two peaks in distribution plot and there is left skewness because long tail is at left side(mean<median)
fig,(ax1,ax2) = plt.subplots(nrows=1,ncols=2)
fig.set_size_inches(20,4)
sns.distplot(impute_vehicle_df['pr.axis_rectangularity'],ax=ax1)
ax1.set_title("Distribution Plot")
sns.boxplot(impute_vehicle_df['pr.axis_rectangularity'],ax=ax2)
ax2.set_title("Box Plot")
From above we can see that there are no outliers in pr.axis_rectangularity column and there are two peaks in distribution plot and there is right skewness because long tail is at right side(mean>median)
fig,(ax1,ax2) = plt.subplots(nrows=1,ncols=2)
fig.set_size_inches(20,4)
sns.distplot(impute_vehicle_df['max.length_rectangularity'],ax=ax1)
ax1.set_title("Distribution Plot")
sns.boxplot(impute_vehicle_df['max.length_rectangularity'],ax=ax2)
ax2.set_title("Box Plot")
From above we can see that there are no outliers in max.length_rectangularity column and there are two peaks in distribution plot and there is right skewness because long tail is at right side(mean>median)
fig,(ax1,ax2) = plt.subplots(nrows=1,ncols=2)
fig.set_size_inches(20,4)
sns.distplot(impute_vehicle_df['scaled_variance'],ax=ax1)
ax1.set_title("Distribution Plot")
sns.boxplot(impute_vehicle_df['scaled_variance'],ax=ax2)
ax2.set_title("Box Plot")
From above we can see that there are outliers in scaled_variance column and there are two peaks in distribution plot and there is right skewness because long tail is at right side(mean>median)
#check how many outliers are there in scaled_variance column
q1 = np.quantile(impute_vehicle_df['scaled_variance'],0.25)
q2 = np.quantile(impute_vehicle_df['scaled_variance'],0.50)
q3 = np.quantile(impute_vehicle_df['scaled_variance'],0.75)
IQR = q3-q1
print("Quartie1::",q1)
print("Quartie2::",q2)
print("Quartie3::",q3)
print("Inter Quartie Range::",IQR)
#outliers = q3 + 1.5*IQR, q1 - 1.5*IQR
print("scaled_variance above",impute_vehicle_df['scaled_variance'].quantile(0.75)+(1.5 * IQR),"are outliers")
print("The Outliers in scaled_variance column are",impute_vehicle_df[impute_vehicle_df['scaled_variance']>292]['scaled_variance'].shape[0])
fig,(ax1,ax2) = plt.subplots(nrows=1,ncols=2)
fig.set_size_inches(20,4)
sns.distplot(impute_vehicle_df['scaled_variance.1'],ax=ax1)
ax1.set_title("Distribution Plot")
sns.boxplot(impute_vehicle_df['scaled_variance.1'],ax=ax2)
ax2.set_title("Box Plot")
From above we can see that there are outliers in scaled_variance.1 column and there are two peaks in distribution plot and there is right skewness because long tail is at right side(mean>median)
#check how many outliers are there in scaled_variance.1 column
q1 = np.quantile(impute_vehicle_df['scaled_variance.1'],0.25)
q2 = np.quantile(impute_vehicle_df['scaled_variance.1'],0.50)
q3 = np.quantile(impute_vehicle_df['scaled_variance.1'],0.75)
IQR = q3-q1
print("Quartie1::",q1)
print("Quartie2::",q2)
print("Quartie3::",q3)
print("Inter Quartie Range::",IQR)
#outliers = q3 + 1.5*IQR, q1 - 1.5*IQR
print("scaled_variance.1 above",impute_vehicle_df['scaled_variance.1'].quantile(0.75)+(1.5 * IQR),"are outliers")
print("The Outliers in scaled_variance.1 column are",impute_vehicle_df[impute_vehicle_df['scaled_variance.1']>989.5]['scaled_variance.1'].shape[0])
fig,(ax1,ax2) = plt.subplots(nrows=1,ncols=2)
fig.set_size_inches(20,4)
sns.distplot(impute_vehicle_df['scaled_radius_of_gyration'],ax=ax1)
ax1.set_title("Distribution Plot")
sns.boxplot(impute_vehicle_df['scaled_radius_of_gyration'],ax=ax2)
ax2.set_title("Box Plot")
From above we can see that there are no outliers in scaled_radius_of_gyration column and there is right skewness because long tail is at right side(mean>median)
fig,(ax1,ax2) = plt.subplots(nrows=1,ncols=2)
fig.set_size_inches(20,4)
sns.distplot(impute_vehicle_df['scaled_radius_of_gyration.1'],ax=ax1)
ax1.set_title("Distribution Plot")
sns.boxplot(impute_vehicle_df['scaled_radius_of_gyration.1'],ax=ax2)
ax2.set_title("Box Plot")
From above we can see that there are outliers in scaled_radius_of_gyration.1 column and there is right skewness because long tail is at right side(mean>median)
#check how many outliers are there in scaled_radius_of_gyration.1 column
q1 = np.quantile(impute_vehicle_df['scaled_radius_of_gyration.1'],0.25)
q2 = np.quantile(impute_vehicle_df['scaled_radius_of_gyration.1'],0.50)
q3 = np.quantile(impute_vehicle_df['scaled_radius_of_gyration.1'],0.75)
IQR = q3-q1
print("Quartie1::",q1)
print("Quartie2::",q2)
print("Quartie3::",q3)
print("Inter Quartie Range::",IQR)
#outliers = q3 + 1.5*IQR, q1 - 1.5*IQR
print("scaled_radius_of_gyration.1 above",impute_vehicle_df['scaled_radius_of_gyration.1'].quantile(0.75)+(1.5 * IQR),"are outliers")
print("The Outliers in scaled_radius_of_gyration.1 column are",impute_vehicle_df[impute_vehicle_df['scaled_radius_of_gyration.1']>87]['scaled_radius_of_gyration.1'].shape[0])
fig,(ax1,ax2) = plt.subplots(nrows=1,ncols=2)
fig.set_size_inches(20,4)
sns.distplot(impute_vehicle_df['skewness_about'],ax=ax1)
ax1.set_title("Distribution Plot")
sns.boxplot(impute_vehicle_df['skewness_about'],ax=ax2)
ax2.set_title("Box Plot")
From above we can see that there are outliers in skewness_about column and there is right skewness because long tail is at right side(mean>median)
#check how many outliers are there in skewness_about column
q1 = np.quantile(impute_vehicle_df['skewness_about'],0.25)
q2 = np.quantile(impute_vehicle_df['skewness_about'],0.50)
q3 = np.quantile(impute_vehicle_df['skewness_about'],0.75)
IQR = q3-q1
print("Quartie1::",q1)
print("Quartie2::",q2)
print("Quartie3::",q3)
print("Inter Quartie Range::",IQR)
#outliers = q3 + 1.5*IQR, q1 - 1.5*IQR
print("skewness_about above",impute_vehicle_df['skewness_about'].quantile(0.75)+(1.5 * IQR),"are outliers")
print("The Outliers in skewness_about column are",impute_vehicle_df[impute_vehicle_df['skewness_about']>19.5]['skewness_about'].shape[0])
fig,(ax1,ax2) = plt.subplots(nrows=1,ncols=2)
fig.set_size_inches(20,4)
sns.distplot(impute_vehicle_df['skewness_about.1'],ax=ax1)
ax1.set_title("Distribution Plot")
sns.boxplot(impute_vehicle_df['skewness_about.1'],ax=ax2)
ax2.set_title("Box Plot")
From above we can see that there are outliers in skewness_about.1 column and there is right skewness because long tail is at right side(mean>median)
#check how many outliers are there in skewness_about.1 column
q1 = np.quantile(impute_vehicle_df['skewness_about.1'],0.25)
q2 = np.quantile(impute_vehicle_df['skewness_about.1'],0.50)
q3 = np.quantile(impute_vehicle_df['skewness_about.1'],0.75)
IQR = q3-q1
print("Quartie1::",q1)
print("Quartie2::",q2)
print("Quartie3::",q3)
print("Inter Quartie Range::",IQR)
#outliers = q3 + 1.5*IQR, q1 - 1.5*IQR
print("skewness_about.1 above",impute_vehicle_df['skewness_about.1'].quantile(0.75)+(1.5 * IQR),"are outliers")
print("The Outliers in skewness_about.1 column are",impute_vehicle_df[impute_vehicle_df['skewness_about.1']>40]['skewness_about.1'].shape[0])
fig,(ax1,ax2) = plt.subplots(nrows=1,ncols=2)
fig.set_size_inches(20,4)
sns.distplot(impute_vehicle_df['skewness_about.2'],ax=ax1)
ax1.set_title("Distribution Plot")
sns.boxplot(impute_vehicle_df['skewness_about.2'],ax=ax2)
ax2.set_title("Box Plot")
From above we can see that there are no outliers in skewness_about.2 column and there is left skewness because long tail is at left side(mean<median)
fig,(ax1,ax2) = plt.subplots(nrows=1,ncols=2)
fig.set_size_inches(20,4)
sns.distplot(impute_vehicle_df['hollows_ratio'],ax=ax1)
ax1.set_title("Distribution Plot")
sns.boxplot(impute_vehicle_df['hollows_ratio'],ax=ax2)
ax2.set_title("Box Plot")
From above we can see that there are no outliers in hollows_ratio column and there is left skewness because long tail is at left side(mean<median)
impute_vehicle_df['class'].value_counts()
sns.countplot(impute_vehicle_df['class'])
plt.show()
From above we can see that cars are most followed by bus and then vans.
so by now we analyze each column and we found that there are outliers in some column. now our next step is to know whether these outliers are natural or artificial. if natural then we have to do nothing but if these outliers are artificial then we have to handle these outliers. we have 8 columns in which we found outliers: ->radius_ratio ->pr.axis_aspect_ratio ->max.length_aspect_ratio ->scaled_variance ->scaled_variance.1 ->scaled_radius_of_gyration.1 ->skewness_about ->skewness_about.1
after seeing the max values of above outliers column. it's looks like outliers in above columns are natural not a typo mistake or artificial. as we will apply SVM on above data which is affected by outliers. so better to drop those outliers.
#radius_ratio column outliers
impute_vehicle_df.drop(impute_vehicle_df[impute_vehicle_df['radius_ratio']>276].index,axis=0,inplace=True)
#pr.axis_aspect_ratio column outliers
impute_vehicle_df.drop(impute_vehicle_df[impute_vehicle_df['pr.axis_aspect_ratio']>77].index,axis=0,inplace=True)
#max.length_aspect_ratio column outliers
impute_vehicle_df.drop(impute_vehicle_df[impute_vehicle_df['max.length_aspect_ratio']>14.5].index,axis=0,inplace=True)
impute_vehicle_df.drop(impute_vehicle_df[impute_vehicle_df['max.length_aspect_ratio']<2.5].index,axis=0,inplace=True)
#scaled_variance column outliers
impute_vehicle_df[impute_vehicle_df['scaled_variance']>292]
From above we can see that scaled_variance column outliers has been removed
#scaled_variance.1 column outliers
impute_vehicle_df.drop(impute_vehicle_df[impute_vehicle_df['scaled_variance.1']>989.5].index,axis=0,inplace=True)
#scaled_radius_of_gyration.1 column outliers
impute_vehicle_df.drop(impute_vehicle_df[impute_vehicle_df['scaled_radius_of_gyration.1']>87].index,axis=0,inplace=True)
#skewness_about column outliers
impute_vehicle_df.drop(impute_vehicle_df[impute_vehicle_df['skewness_about']>19.5].index,axis=0,inplace=True)
#skewness_about.1 column outliers
impute_vehicle_df.drop(impute_vehicle_df[impute_vehicle_df['skewness_about.1']>40].index,axis=0,inplace=True)
#display the shape of data frame
print("after fixing outliers shape of dataframe:",impute_vehicle_df.shape)
plt.figure(figsize=(20,4))
sns.heatmap(impute_vehicle_df.corr(),annot=True)
plt.show()
so our objective is to reocgnize whether an object is a van or bus or car based on some input features. so our main assumption is there is little or no multicollinearity between the features. if two features is highly correlated then there is no use in using both features.in that case, we can drop one feature. so heatmap gives us the correlation matrix there we can see which features are highly correlated. From above correlation matrix we can see that there are many features which are highly correlated. if we see carefully then scaled_variance.1 and scatter_ratio has 1 correlation and many other features also there which having more than 0.9 correlation so we will drop those columns whose correlation is +-0.9 or above. so there are 8 such columns: ->max.length_rectangularity ->scaled_radius_of_gyration ->skewness_about.2 ->scatter_ratio ->elongatedness ->pr.axis_rectangularity ->scaled_variance ->scaled_variance.1
now, again we have two option we will drop those above eight columns manually or we will apply pca and let pca to be decided how it will explain above data which is in high dimension with smaller number of variables. we will see both approaches.
#now separate the dataframe into dependent and independent variables
impute_vehicle_df_independent_attr = impute_vehicle_df.drop('class',axis=1)
impute_vehicle_df_dependent_attr = impute_vehicle_df['class']
print("shape of impute_vehicle_df_independent_attr::",impute_vehicle_df_independent_attr.shape)
print("shape of impute_vehicle_df_dependent_attr::",impute_vehicle_df_dependent_attr.shape)
#now sclaed the independent attribute and replace the dependent attr value with number
impute_vehicle_df_independent_attr_scaled = impute_vehicle_df_independent_attr.apply(zscore)
impute_vehicle_df_dependent_attr.replace({'car':0,'bus':1,'van':2},inplace=True)
#make the covariance matrix and we have 18 independent features so aur covariance matrix is 18*18 matrix
impute_cov_matrix = np.cov(impute_vehicle_df_independent_attr_scaled,rowvar=False)
print("Impute cov_matrix shape:",impute_cov_matrix.shape)
print("Impute Covariance_matrix",impute_cov_matrix)
#now with the help of above covariance matrix we will find eigen value and eigen vectors
impute_pca_to_learn_variance = PCA(n_components=18)
impute_pca_to_learn_variance.fit(impute_vehicle_df_independent_attr_scaled)
#display explained variance ratio
impute_pca_to_learn_variance.explained_variance_ratio_
#display explained variance
impute_pca_to_learn_variance.explained_variance_
#display principal components
impute_pca_to_learn_variance.components_
plt.bar(list(range(1,19)),impute_pca_to_learn_variance.explained_variance_ratio_)
plt.xlabel("eigen value/components")
plt.ylabel("variation explained")
plt.show()
plt.step(list(range(1,19)),np.cumsum(impute_pca_to_learn_variance.explained_variance_ratio_))
plt.xlabel("eigen value/components")
plt.ylabel("cummalative of variation explained")
plt.show()
From above we can see that 8 dimension are able to explain 95%variance of data. so we will use first 8 principal components
#use first 8 principal components
impute_pca_eight_components = PCA(n_components=8)
impute_pca_eight_components.fit(impute_vehicle_df_independent_attr_scaled)
#transform the impute raw data which is in 18 dimension into 8 new dimension with pca
impute_vehicle_df_pca_independent_attr = impute_pca_eight_components.transform(impute_vehicle_df_independent_attr_scaled)
#display the shape of new_vehicle_df_pca_independent_attr
impute_vehicle_df_pca_independent_attr.shape
Now before apply pca with 8 dimension which are explaining more than 95% variantion of data we will make model on raw data after that we will make model with pca and then we will compare both models.
#now split the data into 80:20 ratio
impute_rawdata_X_train,impute_rawdata_X_test,impute_rawdata_y_train,impute_rawdata_y_test = train_test_split(impute_vehicle_df_independent_attr_scaled,impute_vehicle_df_dependent_attr,test_size=0.20,random_state=1)
impute_pca_X_train,impute_pca_X_test,impute_pca_y_train,impute_pca_y_test = train_test_split(impute_vehicle_df_pca_independent_attr,impute_vehicle_df_dependent_attr,test_size=0.20,random_state=1)
print("shape of impute_rawdata_X_train",impute_rawdata_X_train.shape)
print("shape of impute_rawdata_y_train",impute_rawdata_y_train.shape)
print("shape of impute_rawdata_X_test",impute_rawdata_X_test.shape)
print("shape of impute_rawdata_y_test",impute_rawdata_y_test.shape)
print("--------------------------------------------")
print("shape of impute_pca_X_train",impute_pca_X_train.shape)
print("shape of impute_pca_y_train",impute_pca_y_train.shape)
print("shape of impute_pca_X_test",impute_pca_X_test.shape)
print("shape of impute_pca_y_test",impute_pca_y_test.shape)
#fit the model on impute raw data
svc.fit(impute_rawdata_X_train,impute_rawdata_y_train)
#predict the y value
impute_rawdata_y_predict = svc.predict(impute_rawdata_X_test)
#now fit the model on pca data with new dimension
svc.fit(impute_pca_X_train,impute_pca_y_train)
#predict the y value
impute_pca_y_predict = svc.predict(impute_pca_X_test)
#display accuracy score of both models
print("Accuracy score with impute raw data(18 dimension)",accuracy_score(impute_rawdata_y_test,impute_rawdata_y_predict))
print("Accuracy score with impute pca data(8 dimension)",accuracy_score(impute_pca_y_test,impute_pca_y_predict))
#display confusion matrix of both models
print("Confusion matrix with impute raw data(18 dimension)\n",confusion_matrix(impute_rawdata_y_test,impute_rawdata_y_predict))
print("Confusion matrix with impute pca data(8 dimension)\n",confusion_matrix(impute_pca_y_test,impute_pca_y_predict))
#drop the columns
impute_vehicle_df_independent_attr_scaled.drop(['max.length_rectangularity','scaled_radius_of_gyration','skewness_about.2','scatter_ratio','elongatedness','pr.axis_rectangularity','scaled_variance','scaled_variance.1'],axis=1,inplace=True)
#display the shape of new dataframe
impute_vehicle_df_independent_attr_scaled.shape
impute_dropcolumn_X_train,impute_dropcolumn_X_test,impute_dropcolumn_y_train,impute_dropcolumn_y_test = train_test_split(impute_vehicle_df_independent_attr_scaled,impute_vehicle_df_dependent_attr,test_size=0.20,random_state=1)
print("shape of impute_dropcolumn_X_train",impute_dropcolumn_X_train.shape)
print("shape of impute_dropcolumn_y_train",impute_dropcolumn_y_train.shape)
print("shape of impute_dropcolumn_X_test",impute_dropcolumn_X_test.shape)
print("shape of impute_dropcolumn_y_test",impute_dropcolumn_y_test.shape)
#fit the model on dropcolumn_X_train,dropcolumn_y_train
svc.fit(impute_dropcolumn_X_train,impute_dropcolumn_y_train)
#predict the y value
impute_dropcolumn_y_predict = svc.predict(impute_dropcolumn_X_test)
#display the accuracy score and confusion matrix
print("Accuracy score with impute dropcolumn data(10 dimension)",accuracy_score(impute_dropcolumn_y_test,impute_dropcolumn_y_predict))
print("Confusion matrix with impute dropcolumn data(10 dimension)\n",confusion_matrix(impute_dropcolumn_y_test,impute_dropcolumn_y_predict))
From above we can see that pca is doing a very good job.Accuracy with pca is approx 94% and with raw data approx 96% but note that pca 94% accuracy is with only 8 dimension where as rawdata has 18 dimension.But every thing has two sides, disadvantage of pca is we cannot do interpretation with the model.it's blackbox.
Sports management
Company X is a sports management company for international cricket.
# All necessary packages have been imported for the previous questions
# Hence it is not required here
cricket = pd.read_csv('Part4 - batting_bowling_ipl_bat.csv')
cricket = cricket.dropna()
cricket.head()
cricket.shape
cricket
cricket.info()
Here, you can see all the column names, total values and type of the values.
We have Numerical variables which contains number as values in all of the columns in the dataset
cricket.describe()
# You can see the descriptive statistics of numerical variables such as total count, mean,
# standard deviation, minimum and maximum values and three quantiles of the data (25%,50%,75%).
cricket.isnull().sum() #checks if there are any missing values
plt.rcParams['figure.figsize'] = (20, 10)
sns.countplot(cricket['Runs'], palette = 'dark')
plt.title('Runs',fontsize = 20)
plt.show()
plt.rcParams['figure.figsize'] = (20, 10)
sns.countplot(cricket['Ave'], palette = 'Set3')
plt.title('Average Runs',fontsize = 20)
plt.show()
plt.rcParams['figure.figsize'] = (20, 10)
sns.countplot(cricket['SR'], palette = 'prism')
plt.title('Strike Rate',fontsize = 20)
plt.show()
plt.rcParams['figure.figsize'] = (20, 10)
sns.countplot(cricket['Fours'], palette = 'dark')
plt.title('No. of Fours Scored',fontsize = 20)
plt.show()
plt.rcParams['figure.figsize'] = (20, 10)
sns.countplot(cricket['Sixes'], palette = 'Set2')
plt.title('No. of Sixes scored',fontsize = 20)
plt.show()
plt.rcParams['figure.figsize'] = (20, 10)
sns.countplot(cricket['HF'], palette = 'Set1')
plt.title('No. of Half Centuries scored',fontsize = 20)
plt.show()
plt.subplots_adjust(left=0.125, bottom=0.1, right=0.9, top=0.9,
wspace=0.5, hspace=0.2)
plt.subplot(141)
plt.title('Runs')
sns.violinplot(y='Runs',data=cricket,color='m',linewidth=2)
plt.subplot(142)
plt.title('Average Runs')
sns.violinplot(y='Ave',data=cricket,color='g',linewidth=2)
plt.subplot(143)
plt.title('Strike Rate')
sns.violinplot(y='SR',data=cricket,color='r',linewidth=2)
plt.show()
plt.subplot(151)
plt.title('No. of Fours')
sns.violinplot(y='Fours',data=cricket,color='b',linewidth=2)
plt.subplot(152)
plt.title('No. of Sixes')
sns.violinplot(y='Sixes',data=cricket,color='g',linewidth=2)
plt.subplot(153)
plt.title('No. of Half Centuries')
sns.violinplot(y='HF',data=cricket,color='r',linewidth=2)
plt.show()
sns.FacetGrid(cricket, size=5).map(sns.distplot,"Runs").add_legend()
sns.FacetGrid(cricket, size=5).map(sns.distplot,"Ave").add_legend()
sns.FacetGrid(cricket, size=5).map(sns.distplot,"SR").add_legend()
sns.FacetGrid(cricket, size=5).map(sns.distplot,"Fours").add_legend()
sns.FacetGrid(cricket, size=5).map(sns.distplot,"Sixes").add_legend()
sns.FacetGrid(cricket, size=5).map(sns.distplot,"HF").add_legend()
cricket.describe()
# I have displayed this again simply because all the data in this is shown through the graphs present above
sns.pairplot(cricket, size=3)
plt.show()
Using the Dream 11 format, where each attribute of the player is given a prticular amount of weightage and then computing a composite score for each player.
First we have to normalize the data (all six attributes have been given a score out of 10)
For eg: if Chris Gayle has scored 733 runs, and 733 runs is the maximum numberof runs scored by any player, Chris Gayle gets a 10/10 in the runs column and so on.
cricket.head()
cricket_drop = cricket.drop(labels=['Name'], axis=1)
cricket_drop.head()
cricket_drop['left_col'] = 1
cricket_drop.head()
cols = cricket_drop.columns.tolist()
cols
#Here we are converting columns of the dataframe to list so it would be easier for us to reshuffle the columns.
#We are going to use cols.insert method
cols.insert(0, cols.pop(cols.index('left_col')))
cols
cricket_drop = cricket_drop.reindex(columns = cols)
# By using df_drop.reindex(columns= cols) we are converting list to columns again
# Now we are separating features of our dataframe from the labels.
X = cricket_drop.iloc[:,1:7].values
y = cricket_drop.iloc[:,0].values
X
y
np.shape(X)
np.shape(y)
Standardization refers to shifting the distribution of each attribute to have a mean of zero and a standard deviation of one (unit variance). It is useful to standardize attributes for a model. Standardization of datasets is a common requirement for many machine learning estimators implemented in scikit-learn; they might behave badly if the individual features do not more or less look like standard normally distributed data
from sklearn.preprocessing import StandardScaler
X_std = StandardScaler().fit_transform(X)
mean_vec = np.mean(X_std, axis=0)
cov_mat = (X_std - mean_vec).T.dot((X_std - mean_vec)) / (X_std.shape[0]-1)
print('Covariance matrix \n%s' %cov_mat)
print('NumPy covariance matrix: \n%s' %np.cov(X_std.T))
plt.figure(figsize=(8,8))
sns.heatmap(cov_mat, vmax=1, square=True,annot=True,cmap='cubehelix')
plt.title('Covariance between different features')
This part is meat and juice of whole understanding of PCA. We calculate "Eigen Values" and "Eigen Vectors" of Covariance Matrix, which we calculated.
eig_vals, eig_vecs = np.linalg.eig(cov_mat)
print('Eigenvectors \n%s' %eig_vecs)
print('\nEigenvalues \n%s' %eig_vals)
In order to decide which eigenvector(s) can dropped without losing too much explanation power of original feature values for the construction of lower-dimensional subspace, we need to inspect the corresponding eigenvalues: The eigenvectors with the lowest eigenvalues bear the least information about the variance of the data; those are the ones can be dropped because eigen vector components with low eigen values have low power to explain original values.
# Make a list of (eigenvalue, eigenvector) tuples
eig_pairs = [(np.abs(eig_vals[i]), eig_vecs[:,i]) for i in range(len(eig_vals))]
# Sort the (eigenvalue, eigenvector) tuples from high to low
eig_pairs.sort(key=lambda x: x[0], reverse=True)
# Visually confirm that the list is correctly sorted by decreasing eigenvalues
print('Eigenvalues in descending order:')
for i in eig_pairs:
print(i[0])
After sorting the eigenpairs, the next question is "how many principal components are we going to choose for our new feature subspace?" A useful measure is the so-called "explained variance," which can be calculated from the eigenvalues. The explained variance tells us how much information (variance) can be attributed to each of the principal components.
tot = sum(eig_vals)
var_exp = [(i / tot)*100 for i in sorted(eig_vals, reverse=True)]
with plt.style.context('dark_background'):
plt.figure(figsize=(6, 4))
plt.bar(range(6), var_exp, alpha=0.5, align='center',
label='individual explained variance')
plt.ylabel('Explained variance ratio')
plt.xlabel('Principal components')
plt.legend(loc='best')
plt.tight_layout()
The plot above clearly shows that maximum variance (somewhere around 70%) can be explained by the first principal component (labelled = 0) alone. The second principal component (labelled = 1) shares almost 15% amount of information. The remaining components share less amount of information as compared to the rest of the Principal components (> 10%). Together first and second components capture around 85% of variance explained. The model changes from 6 dimension to a two dimension model!
The construction of the projection matrix that will be used to transform the Human resouces analytics data onto the new feature subspace. Suppose only 1st and 2nd principal component shares the maximum amount of information say around 85%. Hence we can drop other components. Here, we are reducing the 6-dimensional feature space to a 2-dimensional feature subspace, by choosing the “top 2” eigenvectors with the highest eigenvalues to construct our d×k-dimensional eigenvector matrix W
matrix_w = np.hstack((eig_pairs[0][1].reshape(6,1),
eig_pairs[1][1].reshape(6,1)
))
print('Matrix W:\n', matrix_w)
In this last step we will use the 6×2-dimensional projection matrix W to transform our samples onto the new subspace via the equation Y=X×W
Y = X_std.dot(matrix_w)
Y
cricket_drop['Runs'] = (cricket_drop['Runs']/cricket_drop['Runs'].max())*10
cricket_drop['Ave'] = (cricket_drop['Ave']/cricket_drop['Ave'].max())*10
cricket_drop['SR'] = (cricket_drop['SR']/cricket_drop['SR'].max())*10
cricket_drop['Fours'] = (cricket_drop['Fours']/cricket_drop['Fours'].max())*10
cricket_drop['Sixes'] = (cricket_drop['Sixes']/cricket_drop['Sixes'].max())*10
cricket_drop['HF'] = (cricket_drop['HF']/cricket_drop['HF'].max())*10
cricket_drop.head()
cricket['Overall Score'] = cricket_drop['Runs']*70/85.0 + cricket_drop['Ave']*15/85.0
cricket.head()
#Assign the grades
def determine_grade(scores):
if scores >= 8 and scores <= 10:
return 'Grade A'
elif scores >= 6 and scores < 8:
return 'Grade B'
elif scores >= 4.5 and scores < 6:
return 'Grade C'
elif scores >= 2 and scores < 4.5:
return 'Grade D'
elif scores >= 0 and scores < 2:
return 'Grade E'
cricket['grades']=cricket['Overall Score'].apply(determine_grade)
cricket.info()
cricket.head()
cricket['grades'].value_counts().plot.pie(autopct="%1.1f%%")
plt.show()
cricket['Rank'] = cricket['Overall Score'].rank(ascending=False)
cricket.sort_values(by='Rank', ascending=True, inplace=True)
cricket
Yes; It is possible to do the same on a multimedia data [images and video] and text data. The image is a combination of pixels in rows placed one after another to form one single image each pixel value represents the intensity value of the image, so if you have multiple images we can form a matrix considering a row of pixels as a vector. It requires huge amounts of storage while working with many images where we are using PCA is used to compress it and preserve the data as much as possible. Look at Python Code:
# Importing libraries:
import matplotlib.image as mplib
import matplotlib.pyplot as plt
import numpy as np
from sklearn.decomposition import PCA
# Reading an image and printing the shape of the image
img = mplib.imread('/content/opengenus_logo.png')
print(img.shape)
plt.imshow(img)
# Reshaping the image to 2-dimensional so we are multiplying columns with depth so 225 X 3 = 675.
img_r = np.reshape(img, (225, 582))
print(img_r.shape)
# Applying PCA so that it will compress the image, the reduced dimension is shown in the output.
pca = PCA(32).fit(img_r)
img_transformed = pca.transform(img_r)
print(img_transformed.shape)
print(np.sum(pca.explained_variance_ratio_) )
# Retrieving the results of the image after Dimension reduction.
temp = pca.inverse_transform(img_transformed)
print(temp.shape)
temp = np.reshape(temp, (225,225 ,3))
print(temp.shape)
plt.imshow(temp)